Stage 4, Report https://github.com/anhaidgroup/py_entitymatching/blob/master/notebooks/vldb_demo/Demo_notebook_v6.ipynb



In [ ]:

    
S = em.sample_table(C, 450)
# Label S
S = em.label_table(S, 'label')
# Load the pre-labeled data
S = em.read_csv_metadata('labeled_data_demo.csv', 
                         key='_id',
                         ltable=A, rtable=B, 
                         fk_ltable='ltable_id', fk_rtable='rtable_id')
len(S)
# Split S into I an J
IJ = em.split_train_test(S, train_proportion=0.7, random_state=0)
I = IJ['train']
J = IJ['test']



In [ ]:

    
# Create a set of ML-matchers
dt = em.DTMatcher(name='DecisionTree', random_state=0)
svm = em.SVMMatcher(name='SVM', random_state=0)
rf = em.RFMatcher(name='RF', random_state=0)
lg = em.LogRegMatcher(name='LogReg', random_state=0)
ln = em.LinRegMatcher(name='LinReg')

Decision Tree



In [ ]:

Random Forest

SVM

Naive Bayes

Logistic Regression

For each of the five learning methods (Decision Tree, Random Forest, SVM, Naive Bayes, Logistic Regression), report the precision, recall, and F-1 that you obtain when you perform cross validation for the first time for these methods on I.

Report which learning based matcher you selected after that cross validation.

Report all debugging iterations and cross validation iterations that you performed. For each debugging iteration, report (a) what is the matcher that you are trying to debug, and its precision/recall/F-1, (b) what kind of problems you found, and what you did to fix them, (c) the final precision/recall/F-1 that you reached

For each cross validation iteration, report (a) what matchers were you trying to evaluate using the cross validation, and (b) precision/recall/F-1 of those.

• Report the final best learning-based matcher that you selected, and its precision/recall/F-1.

Now report the following: – For each of the five learning methods, train it on I, then report its precision/recall/F-1 on J. – For the final best matcher Y∗, train it on I then report its precision/recall/F-1 on J

List the final set of features that you are using in your feature vectors.

• Report an approximate time estimate: (a) how much did it take to label the data, and (b) to find the best learning-based matcher.

• Discuss why you can't reach higher precision, recall, F-1.